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Comparing and computing distances between phylogenetic trees are im- 
portant biological problems, especially for models where edge lengths play an 
important role. The geodesic distance measure between two phylogenetic trees 
with edge lengths is the length of the shortest path between them in the con- 
tinuous tree space introduced by Billera, Holmes, and Vogtmann. This tree 
space provides a powerful tool for studying and comparing phylogenetic trees, 
both in exhibiting a natural distance measure and in providing a Euclidean- 
like structure for solving optimization problems on trees. An important open 
problem is to find a polynomial time algorithm for finding geodesies in tree 
space. This paper gives such an algorithm, which starts with a simple initial 
■ path and moves through a series of successively shorter paths until the geodesic 

■rj- , is attained. 

cn 

^ ; Introduction 

A phylogenetic tree describes the evolutionary history of a set of organisms, with the 
leaf vertices representing the organisms and the interior vertices representing points 
at which the evolutionary history branches. Researchers use different criteria and 
methods for constructing phylogenetic trees from available data about the set of or- 
ganisms, which can result in several possible trees or a distribution of trees describing 
the phylogenetic history. For example, reconstructing the most likely tree for different 
genes may yield different trees [!§] ; different reconstruction methods can also produce 
different trees on the same set of organisms [11]. Thus a way is needed to quantita- 
tively compare different phylogenetic trees, by computing some metric describing the 
differences between them. Many such distance measures have been proposed, includ- 
ing the Robinson- Foulds or partition distance [H] , the Nearest Neighbor Interchange 
(NNI) distance p2], the Subtree-Prune-and-Regraft (SPR) distance [7], and the Tree 
Bisection and Reconnection (TBR) distance [2]. These measures tend to emphasize 
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the differences in topologies between the trees, and often do not account for edge 
lengths. If the edge lengths represent such information as number of mutations be- 
tween speciation events, we lose important information by ignoring them. Worst yet, 
most of these measures cannot be computed efficiently and so are of little use when 
applied to large trees. 

To address this issue, Billera et al. |1] propose the concept of a continuous tree 
space, and its associated geodesic distance metric, as a natural way to embed and 
compare phylogenetic trees. This tree space consists of a set of Euclidean regions, 
called orthants, one for each tree topology. Orthants are joined together whenever 
one tree topology can be made into another by exchanging edges between the trees. 
Within an orthant, the coordinates of each point represent the edge lengths for a 
particular tree with the topology associated with that orthant. The geodesic between 
two trees is the unique shortest path connecting the two associated points in this 
space. Thus traversing the geodesic corresponds to continuously transforming one tree 
into the other. In contrast to previous measures, geodesic distance incorporates in a 
natural way edge lengths as well as the tree topology. Furthermore, the uniqueness 
of the geodesic between any pair of trees and the continuity of the tree space suggest 
this framework has useful properties with respect to optimization over trees and to 
formulating statistical measures associated with trees ([9] and [ID]). Other versions of 
tree space with different metrics or no metric have been investigated in phylogenetics 
contexts ([8], [6], and p3] for example) and in combinatorial ones ([21] and [T6]). 

Two algorithms have been previously proposed for computing the geodesic dis- 
tance: GeoMeTree [12] and GeodeMaps [15]. Both these algorithms search 
through an exponential number of candidate paths to find the geodesic, so their 
run time is exponential in the size of the trees. Currently there are no known 
polynomial-time algorithms to find tree space geodesies, although a polynomial time 
V^-approximation of the geodesic distance was given by Amenta et al. [3]. Some 
combinatorial and geometric properties of the space of phylogenetic trees were also 
presented in [T5] . 

This paper presents the first polynomial-time method for computing geodesic dis- 
tances — and the associated geodesies — between trees in tree space. The algorithm 
uses a different approach from the previous papers, by starting with a simple path be- 
tween the trees and transforming it into successively shorter paths until the geodesic 
is obtained. At each step, the algorithm identifies one new orthant that intersects 
the geodesic, and transforms the current path so that it passes through this new or- 
thant in an optimal manner. By restricting consideration to the orthants intersecting 
the geodesic, the algorithm makes only a polynomial number of path transforma- 
tions. Each new orthant is identified by finding a weighted vertex cover in a specially 
constructed bipartite graph, which also is a polynomial time problem. 

Section [2] describes the tree space in which we define the geodesic distance, along 
with some important geometric and combinatorial properties relevant to finding the 
geodesic. Section [3] gives the geodesic path algorithm between trees with disjoint 
edge sets, and establishes its correctness and complexity. Section H] explains how 
to efficiently use the geodesic path algorithm when the trees have common edges. 
The final section outlines some interesting problems that extend the scope of this 
algorithm and the associated structures. 
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2 Tree Space and Geodesic Distance 



This section describes the continuous space of phylogenetic trees and the concept 
of geodesic distance. For further details, see pE]. A phylogenetic n-tree, or just n- 
tree, is a tree T = (A, £,£), where X = {0, 1, . . . , n} is a labeled set of vertices, 
called leaves, of degree 1, and £ is the set of interior (nonleaf) edges, such that each 
interior vertex of T has degree at least 3. The leaf is sometimes identified as the 
root of T, although we do not distinguish it here. Each interior edge e is given an 
associated non-negative length |e|, or \e\x if we want to emphasize the tree T to 
which e belongs. For now we do not attach lengths to the leaf edges of T, although 
the relevant properties of tree space apply as well when leaf edge lengths are present. 
At the end of Section [3] we extend our results to trees with leaf edge lengths. For our 
purposes it is most convenient to represent the topology of a tree T by its set £ of 
splits of the interior edges, where the split X e \X e associated with edge e represents 
the partition of X induced by removing the edge e from T. In order that these splits 
actually correspond to a tree, they must be compatible, that is, for every two edges e 
and /, one of the sets X e fl Xf, X e fl Xf, X e fl Xf, or X e n Xf is empty. A set of 
n — 2 compatible splits uniquely determines the topology of an n-tree [20l Theorem 
3.1.4]. Because of this correspondence, we henceforth identify edges in two trees if 
they correspond to the same split. 

Two example 5-trees are given in Figure [TJ One can verify that the six given edges 
are distinct, but that, for example, the edge e\ in T and the edge e§ in tree T' have 
compatible splits. 




ei: {0,1, 2, 5} | {3, 4} 
splits e 2 : {0,3, 4, 5} |{1,2} 

e 3 : {0,5} |{1,2,3,4} 




e 4 

splits es 



T 

{0,1, 4, 5} |{2,3} 
{0,1,2,3} |{4,5} 

{0,1}|{2,3,4,5} 



Figure 1: An example of two 5-trees. 
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2.1 Tree Space 



The geometric study of the continuous space of phylogenetic trees T n , or just tree 
space, was pioneered by Billera et al. in |EJ. Fix leaf set X of cardinality n + 1, where 
the element labeled is either another leaf or the root. In T n each n-tree topology is 
associated with a unique fc-dimensional Euclidean orthant (the non-negative part of 
IR fc ), where k is the cardinality of the set of edges in that tree topology. We denote 
the smallest orthant containing tree T = (X, £, £) by 0(T) = 0(S + ), where T is 
identified in (D(T) by the vector of lengths of its set £+ of positive- length edges. 
The interiors of the orthants are disjoint, and represent trees with the same topology 
but varying (positive) edge lengths. Thus the maximum- dimension orthants have 
dimension n— 2, which is the maximum number of interior edges of an n-tree. Orthants 
of lower dimension correspond to trees with fewer than n — 2 edges, and effectively 
identify the points on the boundary of the higher-dimensional orthants. In particular, 
we can consider a tree T with k positive-length edges to be on the boundary of any 
orthant of higher dimension for which some subset of edges in its corresponding tree 
topology can be contracted — equivalently, the length of the edges can be set to zero 
- to produce the tree T. Note that as a consequence of this property, each edge in 
T also appears in the tree topology of every orthant containing 0(T). 

For example, in Figure [2](a), trees 7\ and T[ are represented by distinct points in 
the same orthant, because they have the same topology but different edge lengths. 
Tree T 2 is represented in a different orthant; T\ and T 2 have the same edge e\, and 
so their orthants will be incident. In particular, the tree T 3 , with single interior edge 
ei, can be obtained from 7\ (T 2 resp.) by setting edges e 2 (e^ resp.) to — and thus 
is a point on the e\ axis common to 0(T\) and 0(T 2 ). 

In general, T n can be embedded in M. N , where N = 2 n — n — 2 is the number of 
possible splits on n + 1 leaves. However, as no point in T n has a negative coordinate 
in ~R N , we often let the positive and negative parts of an axis correspond to different 
splits. This can give a more compact representation of the orthants of interest in 
tree space. For example, Figure MJo) illustrates one way the 2-dimensional orthants 
of five tree topologies in T A can be embedded into IR 3 , by letting e 3 -e 5 and ei-e 4 be 
represented by the the same coordinates. 

2.2 Geodesic Distance 

The tree space T n has two important properties: 

1. T n is path- connected, so we can find a parameterized set T = {7(A) : < A < 1} 
of trees 7(A) G T n connecting any two n-trees. The simplest such path is the 
cone path pE], which consists of the straight line from the first tree to the origin 
and the straight line from the origin to the second tree. We can equivalently 
think of it as the path formed by contracting all of the edges of each tree at 
the appropriate constant rates. For any path T, denote its length to be L(T). 
For our purposes T will always be made up of a sequence of connected line 
segments, each within its own orthant, and so we can write L(T) as the sum of 
the Euclidean lengths of these segments. This provides a natural metric on T n 
by defining the distance d(T, T') between trees T and T' in T n to be the length 
of a shortest path in T n between T and T 1 . 
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2 3 

(a) Two adjacent orthants in Z4 (b) An embedding in R 3 



Figure 2: An example of adjacent orthants in Each orthant is labeled with its 
corresponding tree topology, and the dashed lines indicate the geodesies between the 
specified trees. 

2. [H Lemma 4.1] T n is CAT(O), or non-positively curved. This means, roughly 
speaking, that triangles in T n are "skinnier" than the corresponding triangles 
in Euclidean space. In particular, let X, Y, Z be any three points in T n and 
let W be any point on a shortest path from Y to Z. Then if we construct 
a triangle xyz in Euclidean space with edge lengths \xy\ = d(X,Y), \xz\ = 
d(X, Z) and \yz\ = d(Y, Z), and let w be the point on yz with \yw\ = d(Y, W), 
then d(X,W) < \xw\. 

As T n is CAT(O), there is a unique shortest path Y* between any two trees T and T' 
in T n . The path V* is called the geodesic, and the geodesic distance between T and T' 
is defined as d(T,T') = L(T*). Figure [2]^a) gives the geodesic (represented by dotted 
lines) between two trees in adjacent orthants. This is clearly the straight line between 
them. Figure EJ^b) gives three geodesies between trees with no edges in common. In 
this case the geodesic is either a cone path (as in the (T^TQ- and (T^Tg )-geodesic), 
or it goes through an intermediate orthant (as in the (T2,T3)-geodesic). Thus the edge 
lengths, as well as the tree topology, determine the intermediate orthants traversed 
by the geodesic. 

When the trees have more leaves, the situation becomes more complicated. For 
example, the geodesic between the two trees given in Figure [T] is illustrated in Fig- 
ure [3] by a progression of intermediate trees sampled at equidistant points along that 
geodesic. This geodesic crosses an orthant boundary at A = 1/3 and 2/3, with the 
intermediate leg corresponding to a tree topology containing only two interior edges. 
Two different representations of the sequence of orthants containing the geodesic are 
given in Figure HJ Figure HJa) represents the three relevant orthants embedded in M 3 , 
while Figure SJb) rotates the three orthants so that the geodesic is represented by 
a straight line. Figure H](c) gives the tree associated with each leg of the geodesic. 
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Figure 3: A sampling of the trees along the geodesic between the two trees given in 
Figure HJ 

(Points in the figure are labeled with respect to the coordinate system of the orthant 
containing them.) 

The definition of geodesic given here differs slightly from the classical definition 
in a way that is useful to elucidate. Call a path T a local geodesic if there exists 
some e > so that every subpath of T of length < e is the shortest path between 
its endpoints. The following result shows that in CAT(O) space this local condition 
is sufficient to determine the geodesic. This is also proved in more generality in [5j 
Prop. 1.4, Chap. II.l]. 

Lemma 2.1. In a CAT(O) space, every local geodesic is a geodesic. 

Proof. Let T be a local geodesic from point P to point Q with associated gauge e. 
Denote by T(X, Y) the portion of T between points X and Y on T. Choose disjoint 
points P = P ,P 1 ,...,P k = Q on Y such that L(r(P i _ 1 , P)) < e/2, i = l,...,k. 
Then by definition r(P i _ 1 , P»+i) is a geodesic for % — 1, . . . , k — 1. 

Now assume by induction that the portion r(P , Pg) is a geodesic for 1 < I < i. 
Then L(r(P ,P_i)) = d(P , P-i) by induction, and L(r(P i _ 1 , P i+1 )) = d{P^P i+1 ) 
by the choice of the p's. Construct the triangle po£>i-i£>i+i in Euclidean space as 
specified by the definition of a CAT(O) space, and let pi be the point on pi_\Pi + i with 
\Pi-iPi\ = rf(Pj_ l7 P^. Then d(P , Pi) < \poPi\- But by induction we also have that 
r(P ,Pj) is a geodesic, and so 

d(p , Pi) = L(r(p , Pi)) = L(r(p , p_x)) + L(r(p_!, p)) 

= rf(P ,P_i) + d(Pi-i,Pi) = \p Pi-i\ + \Pi-iPi\- 

This plus the triangle inequality gives |p Pi-i| + \Pi-iPi\ — \PoPi\- But the only way 
this could happen is if pi-i is on poPi, which in turn implies that Pi-\ must also be 
on poPi+i. Thus d(P , P+i) = d(P , P_i) + d(P_i, P+i), and so r(P , P i+1 ) is also a 
geodesic. This establishes the inductive step, and the lemma follows. □ 

It is this result which motivated the idea of this paper. Namely, we can find a 
geodesic between trees T and T' by starting with any (T, T')-path in T n , determining 
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(b) Geodesic in T3 (orthants rotated to form a straight line) 

Figure 4: Two representations of the geodesic between the trees in Figure [TJ The 
geodesic is represented as a dashed line, with the orthants for each topological type 
in the geodesic identified. 

whether it is a local geodesic, and if not, transforming it into a shorter (T, T')-path. 

We define the Geodesic Treepath Problem (GTP), to be the problem of finding 
the geodesic between two trees in T n . The remainder of the paper constructs a 
polynomial-time algorithm for solving GTP. 

2.3 The Path Space of a Geodesic 

Billera et al. [I] showed that the geodesic between T and T' is contained in a sequence 
of orthants, called a path space, satisfying certain properties. These properties were 
further clarified in [TS]. We summarize, from [T5t Section 4], the relevant properties 
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of the shortest path, or geodesic, through a particular path space. For all path spaces 
between T and T', the shortest of these path space geodesies will be the geodesic 
between T and T' . 

We start with some preliminary assumptions and definitions. For now we assume 
that T and T' are disjoint, that is, have no common edges. (We will show at the end 
of Section [3] how to handle common edges between T and T'.) We say that edge sets 
A C £ and B C £' are compatible if every pair of the splits associated with A in £ 
and B in £' are compatible, or equivalently, if A U B determines a unique n-tree. 

Let T = (X, £, £) and T' = (X, £', £') be disjoint n-trees, and let A = 
(Ai, . . . , Ak) and E = (Bi, . . . , B k ) be partitions of £ and £', respectively, such that 
the pair (A, E) satisfies the following property: 

(PI) For each i > j, Ai and Bj are compatible. 

Then for all 1 < i < k, B± U • • ■ U Bi U Ai + \ U • • • U Ak is a compatible set, and hence 
= 0{B\ U • • ■ U Bi U A; +1 U • • • U Afc) is an orthant in tree space. Furthermore, 

the union V = U^ =1 Cj of these orthants forms a connected space. We call V a path 

space, the pair (.4, E) its support, and the shortest (T, T')-path through "P the path 

space geodesic for V . 

Billera et al. proved the following result [H Proposition 4.1] (using the notation 

Ei = A i+ i U • • ■ U A k and F { = B x U • • • U B { for all 1 < i < k). 

Theorem 2.2. For disjoint n-trees T and T' , the geodesic between T and T' is a path 
space geodesic for some path space between T and T' . 

In [151 Section 4], the requirements for path spaces to contain a geodesic are made 
more explicit, and the construction of the actual path space geodesic is given. We 
summarize the results of this research (Proposition 4.1, Proposition 4.2, Corollary 
4.3, Theorem 4.4, and Theorem 4.10 of [15]) below. For set A of edges, we use the 

notation ||A|| = ^^2 eeA |e| 2 to denote the norm of the vector whose components are 
the lengths of the edges in A. 

Theorem 2.3. Let T = (X, £, S) and T' = (X, £', £') be two n-trees, and let T be the 
geodesic in T n between T and T' . Then T can be represented as a path space geodesic 
with support A = (A 1 , . . . , A k ) of £ and E = (£>i, . . . , B k ) of £' which satisfy (PI) 
plus the following additional property: 

(T3n\ \\Ai\\ < \\A 2 \\ ^ < \\A h \\ 

\^ Z > pry ^ wa\ - ••• s ra- 

We call a path space satisfying conditions (PI) and (P2) a proper path space, and the 
associated path space geodesic a proper path. 

The following theorem summarizes results from |15j . 

Theorem 2.4. Let T = (7(A) : < A < 1) be a proper path between T and T with 
support {A, E) . Then Y can be represented in T n with legs 
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where the points on each leg P are associated with tree Ti = (X, £ l , E J ) having edge 
set 

S i = B x U . . . U Bi U A i+1 U . . . U A k 

edge lengths 



e t 4 



and sp/rfs 



Furthermore, the length ofT is 



M±jM | e L eGfi- 



-X" e |X e e G A,- 



L(r) 



|A 1 ||,...j|A fc ||) + (||fl 1 ||,...,||fl Jfc | 
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Remark: It is easy to see that if any two adjacent support pairs in a proper path 
space have their ratios in (P2) equal, then combining them again results in a proper 
path space. That is, if (A, B) is as in Theorem I2.3[ and if 1^4 = p~|| f° r some 
1 < i < k, then (A',B'), where A = (A x , A h , A t U A i+ x, A i+2 , A k ) and B' = 
(Bi, B ix , BiUB i+ i, B i+2 , B k ), is also the support of a proper path space. Further, 
from the description given in Theorem 12 A\ the associated proper path Y does not pass 
through the interior of the deleted orthant, and hence will also be a proper path for 
the new path space. It follows that we can produce a path space for V for which all 
of the inequalities in (P2) are strict. It is shown in [T5l Section 4.2.1] that this in fact 
is a unique representation for V . In this paper, however, we find it more convenient 
to allow relaxed inequalities in defining proper paths. 

Example 1: The cone path between trees T and T' is the path space geodesic for 
the path space consisting of the two orthants containing the original trees, that is, 
A = {£} and B = {£'}■ This trivially satisfies (PI) and (P2), and the associated 
proper path is simply the union of the two straight lines connecting T and T' to the 
origin. 

Example 2: For the geodesic given in Figure [31 the associated path space shown in 
Figure H] consists of the starting orthant, the target orthant, and a single intermediate 
orthant of dimension two on edges {ei, e 6 }. Thus the support for this path space will 
be A = ({e 2 , e 3 }, {ei}) and B = ({e 6 }, {e 4 , e 5 }), which is proper since 

Pill _ ||(3,4)|| ^ 10 _ 114,11^ 



ll^ill 10 ll(3,4)|| \\B 2 

The coordinates (edge lengths) of the path space geodesic as it passes through the 
intermediate orthant can be ascertained from the representation in Figure |4]^b). Here 
the orthants have been positioned so that the geodesic through them is a straight 
line. This can be done using the isometric map presented in [TBI Theorem 4.4] from 
the shaded regions shown in Figure H^a) to M 2 , and maps the geodesic to the straight 
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line (||Ai||, ||A 2 ||) to (-||Bi||, -||-B 2 ||). The length of this line, which is also the length 
of the path, is 



L(T) 



(Il{e 2 ,e3}||,||{ ei }||) + (||{e 6 }||,||{e 4 ,e 5 



(||(3,4)||, 10) + (10, ||(3,4)||) 
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Theorem 12.31 does not completely characterize the geodesic, in that a path can 
be proper without being the geodesic. Consider the two trees T 2 and T 3 given in 
Figure [2jb) . The cone path between T 2 and T 3 is proper, but this path space does 
not contain the geodesic. It is necessary to add the orthant 0({e^, eA-) to this cone 
path space to get the proper path space containing the geodesic. 

To check whether we can add such an intermediate orthant to the current candi- 
date path space and shorten the proper path length, we need to check whether we can 
partition some support pair (Ai, BA into two support pairs, such that the addition of 
the new orthant again results in a proper path space. That is, we drop some subset 
of the edges in A4 and add a subset of the edges in Bi to enter the new orthant, and 
then drop and add the remaining edges to reach the original succeeding orthant. 

However, even if such an intermediate orthant exists, adding it to the path space 
may not result in a shorter proper path. For example, for the trees T' 2 ' and T£ in 
Figure [2](b), we could add orthant 0({e^,e4}) to the cone path space and obtain 
a proper path space, but the proper path for this space will be the same length 
(actually the same path) as it is for the original path space. What we need are 
additional conditions for determining whether adding a specified intermediate orthant 
will result in a shorter proper path. As the next result shows, these conditions in fact 
characterize a proper path as a geodesic. 

Theorem 2.5. A proper (T,T')-path T with support (A,B) satisfying (PI) and (P2) 
is a geodesic if and only if (A, B) satisfies the following additional property: 

(P3) For each support pair (Ai, Bi), there is no nontrivial partition C\ U C 2 
of Ai and partition D 1 U D 2 of Bi, such that C 2 is compatible with D\ 

IICill . IIC2H 



and < 



Di\\ ^ \\D 2 \\- 

Proof. First assume that (P3) does not hold. Then there exists some support pair 
(Ai,Bt), together with partition C\ U C 2 of Ai and partition D\ U D 2 of Be, such 
that C 2 is compatible with D\ and p^jj < Let Y be the point where Y passes 

through the intersection between Ot-\ and Oi. Suppose the sequence ratios look like 

< jAj-i] ^ || AH _ _ \\M _ _ \\AA\ < \\A j+1 \\ < 
ll-Bj-ill ||-Bi|| H-E^H \\Bj\\ \\Bj+i\\ 

with the (i — l) st and (j + l) st indices absent if % — 1 or j — k. Then the legs of V 
going through Oi^\ and Oj have point Y in common and have positive length. Let 
s 7^ Y 7^ t be points before and after Y on T in and Oj, respectively, and let 
T st be the portion of T between s and t, which by choice of s and t consists of two 



10 



straight lines connected at Y. We will construct a path T' st from s to t that is shorter 
than T st , and so T cannot be a geodesic. 

The trees corresponding to s and t have edges B\ U • • • U _Bj_i U A, +1 U ■ • • U 4 
in common, so by the remarks at the beginning of the subsection, these edges change 
their length uniformly between s and t. Thus we can restrict attention to the trees 
s', Y' , t' comprised of s, Y, t with their common edges contracted. The s' and t' have 
the edges A — A{ U • • • U Aj and B — Bi U • • ■ U Bj, respectively, and by Theorem 12.41 
the edges in A contract uniformly in T from s' to Y', and the edges in B expand 
uniformly in V from Y' to t'. Thus there is a positive real number c such that the 
length of any edge e £ A at s' is c- |e|-r, and a positive real number d such that the 
length of any edge / £ B at t' is d-|/|T'- 

Let Ci = AiU- • -UA^iUCi, (7 2 = C 2 UA e+1 U- • -UAj, Di = Bill- ■ -UB^UDt, and 

-D2 = D 2 U U • • • U Sj. Then by the properties of the sets involved, jj=^jj < jjj^jj- 

Let A' , b' ,0^,0^, d[, d' 2 be the edge sets A, B, C\, C 2 , D\, D 2 scaled by c or d to 
represent the edges at the points s' or t' . Since C 2 is compatible with D\, and 
g^} < gj implies that g < g}, then = ({C' 1 ,C' 2 }, is a proper 

path space between s' and t' . Thus Theorem 12.41 holds here as well. Let T' st be the 
proper path between s' and t' in V'. Then by Theorem 12.41 T' st passes through the 
relative interior of the orthants in V, but not through Y' . Using Equation ([T]), we 
get that 

L(T st ) = \\A'\\ + 0\\ and L(T' st ) = \\(\\C[\\, \\C' 2 \\) + (H^IUI^IDH 

Now H£t4 7^ j=$j together with the triangle inequality implies that L(T' st ) < L(T st ), 

and so T is not the geodesic between T and T' . 

Conversely, assume that V is not a geodesic. By Lemma 12.11 this is the case if 
and only if it is not locally shortest in tree space. If so, then this must also happen 
at some bend in V — i.e. intersection of orthants — such that we could cut through 
some additional orthants to shorten its length. So suppose Y is such a point, with Y 
bending at the intersection between and 0j. 

We first consider the simple case where ip'" 1 !! < 1144 < !| d' +1 |! , with the right or 

— 1|| H-Dlll H-Di + lH 

left inequality absent if % — 1 or % = k. Let s 7^ Y 7^ t be points before and after Y 
on T in Oi-i and Oi, respectively. Then the section of V between s and t is not the 
geodesic from s to t. Let V be the proper path space containing the geodesic from 
s to t, with support (A 1 , B'), where A' = (A[, A' k ,) and B' = (B[, B' k ,). By our 
remark, we can assume that V' satisfies (P2) with strict inequalities, and note that 
k' must be greater than 1. Let C\ = A[, C 2 = Ai\Ci, D\ — B[, and D 2 = Bi\Di, 
and set the length of each edge in C\, C 2 , D\ and D 2 to the length that that edge 
has in T or T' . Then (C\,C 2 ) partitions Ai and (Di,D 2 ) partitions Bi, and further, 
C 2 and D\ are compatible since V satisfies (PI). 

It remains to show that pM < jjy^- By the same argument as above, we have 
positive constants c and d such that for any edge e £ Aj, its length at s is c-|e|y, and 
for any edge / £ Bj, its length at t is d ■ |/|t'- Since (A',B') was chosen to satisfy 
(P2) with strict inequalities, we have ^[[^|| < ^211 ' an< ^ ^ us pM < |f§l as desired. 

Next we consider the case where Ht'' 1 !! < 1144 = • • • = 144 < ll^^ 1 !! , again with 

H-t>i-l II 1-°! II ll-Dj II \\£>r 1 
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the right or left inequality absent if % — 1 or j = k. By combining the Ai, . . . , Aj 
and Bi, . . . ,Bj as per the remark above, we can apply the simple case, so that there 
exist partitions Ci U C 2 of A = Ai U ■ • • U Aj and D ± U D 2 of B = Bi U ■ • • U Bj, such 

\\C II HO II \\A IP 

that C 2 is compatible with D\ and < po- Let x be the common value of p^jp, 

£ = i, . . . , j , and let = ||^ n Ci\\ 2 and 6^ = \\Bg n -D«|| 2 for z = 1, 2, £ = i, . . . ,j. 

Note that Ai H C*2 is compatible with B^DDi, and thus if (P3) holds, it must be that 

for all i < I < j , — ^ > , and hence 
bu hi 



°u bu + b 2 £ b 2 e 



But 



-4, 



= x. 



au + CL21 = IK^nd) u (A e nc L/i 

bit + hi \\(B e nD 1 )U(B e C]D 2 )\\ 2 \\B £ \\ 2 

Thus we have — ^ > x > for all % < ^ < j, so that 

bu b 2 £ 

HdH 2 _ HuL^ncj 2 _ >x> EL^ _ l|uL^nc 2 || 2 _ ||c 2 || 2 



Pill 2 Uu^nAH* EL^~ ~EL^ llu^nAall 2 p 2 || 2 

But this contradicts the property of Ci, C*2, -Di and D 2 , and thus we have found a 
partition of an individual pair (Ai,Be) also violating (P3), as desired. 

□ 

Example 2 (continued): For the example path given in Figures [3] and HI if we 

consider the cone path, then (P3) is violated by sets (C\,C 2 ) = ({e 2 , e^}, {ei}) and 
(Di,D 2 ) = ({ee}, {e4, es}), since the inequality in Equation (T5]) is strict. With the 
added orthant the resulting proper path becomes the geodesic, since there are no 
nontrivial partitions of either support pair, and hence (P3) holds. 



3 A Polynomial Algorithm to Solve the Geodesic 
Treepath Problem 

Theorem 12.51 characterizes when a given proper path T with support (A, B) is a 
geodesic by specifying when a local improvement can be made for the path. This 
suggests the following iterative improvement scheme for finding a geodesic between 
trees T and T'\ 

1. Begin with some proper (T, T')-path T with support (A°,B°). 

2. At each stage we have proper path T e having support (A , B e ) satisfying con- 
dition (PI) and (P2). Check to see if (A e ,B e ) also satisfies the condition (P3), 
and if not, create a new proper path T e+1 with support (A e+1 , B e+1 ) and having 
smaller length than T e . 

3. Continue until the geodesic is found. 
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We now proceed to implement this procedure. For our starting proper path, choose 
T° to be the cone path [I], having support ^4° = (£) and £>° = (£'). This sup- 
port vacuously satisfies condition (PI) and (P2), and the path simply corresponds to 
contracting T and T' uniformly to the origin. 

To perform the iterative step, we recast it as a problem on bipartite graphs. To 
do this, define the incompatibility graph G(A, B) between sets A C £ and B C £' 
to be the bipartite graph whose vertex set corresponds to A U B, and whose edges 
correspond to those pairs e G A and / G B such that the corresponding splits X e \X e 
and Xf\Xf are incompatible. An independent set in G(A,B) is any set of vertices 
having no edges of G(A, B) between them. The following lemma follows directly from 
the definition of compatibility between sets: 

Lemma 3.1. Two edge sets A C £ and B C £' are compatible if and only if they 
form an independent set in G(£,£'). 

We can use Lemma 13.11 to restate the problem of determining whether a support 
(A e , B l ) satisfies (P3) as follows: 

Extension Problem 

Given: Sets A C £ and B C £' 

Question: Does there exist a partition C\ U C2 of A and a partition D\ U D 2 of 
B, such that 

(i) C 2 U Di corresponds to an independent set in G(A, B), 
(U) m < JIM ? 

y u ) piii ^ iid 2 ii • 

Lemma 3.2. A proper path T with support (A , B e ) is a geodesic if and only if the 
Extension Problem has no solution for any support pair (A iy Bi) of (A , B e ). 

The proof follows immediately from Theorem 12.51 and the previous discussion. 

We next proceed to solve the Extension Problem. Since scaling will not affect (ii), 
we first scale the edge lengths in A and B so that \\A\\ = \\B\\ = 1. By squaring (ii), 
we get the equivalent condition 

1-||C 2 || 2 ||C 2 || 2 



\Di\\ 2 1-Pi| 



or 



nc 2 ir+piii 2 = $>r+]rm 2 >i. 

e€C 2 /€£>! 

Thus the Extension Problem reduces to that of finding an independent set in G(A, B) 
having sufficiently large total weight, where the vertices are weighted by the normal- 
ized squares of the edge lengths of A and B. 

Now note that the pair C2 and Di form an independent set in G(A, B) if and only 
if their complements C\ and D 2 form a vertex cover for G(A, B), that is, every edge 
of G(A, B) is incident to a vertex of either C\ or D 2 . Thus the Extension problem 
has a solution if and only if the min weight vertex cover for G(A, B) has weight 
||Ci|| 2 + H-D2II 2 < 1- (Note that a solution to the extension problem will necessarily 
result in a nontrivial cover, and hence nontrivial partitions (Ci,C2) and (Di,!^).) 
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Lemma 3.3. The Extension Problem can be solved in 0(n 3 ) time. 



Proof. The Min Weight Vertex Cover Problem can be solved on bipartite graphs by 
a simple extension of the max flow formulation of the Min Cardinality Vertex Cover 
Problem (see e.g. PQ, Section 12.3), using the vertex weights as capacities on the 
source and sink arcs. Maximum flows can be found in 0(n 3 ) time (see e.g. [I], 
Section 7.7). □ 

The solution to the Extension Problem also suggests what the new proper path 
T e+1 should look like. Namely, if the Extension Problem for support sets Ai and Bi 
results in a min weight cover by vertex sets C\ C A{ and D2 C Bi with complements 
C2 and Di, respectively, then we replace Ai and Bi in A and B by the ordered pairs 
(61,62) and (D\,D2). We summarize this in a formal algorithm. 

The GTP Algorithm 

Input: n-trees T = (X, £, E) and T = (X, £' , £') 
Output: The path space geodesic between T and T' 
Algorithm: 

Initialize: Form the incompatibility graph G(S,S') between T and T', and 
set r° to be the cone path between T and T' with support A = (£) and 
B° = (£'). 

Iterative step: At stage £, we have proper path T e with support (A e , B ) 
satisfying conditions (PI) and (P2). 

for each support pair (Ai,Bi) in (A e ,B e ), solve the Extension Problem 
on (Ai, Bi). Specifically, find a min weight vertex cover for the graph 
G(Ai, Bi) using vertex weights 



k| 2 



ll^ill 2 



eeAi 
e G B % 



if every min weight cover found above has weight > 1, then T e satisfies 
(P3), and hence is the geodesic between T and T' . 

else choose any min weight vertex cover C\U D2, C\ C Ai and D2 C Bi 

UQ ||2 

with complements C2 and D\, respectively, having weight L. 2 + 

ifep < 1. Replace A% and Bi in A 1 and B e by the ordered pairs 

(Ci, C 2 ) and (£>x, £) 2 ), respectively, to form new support (A e+1 , B e+1 ) 
with associated proper path T e+l . 

To establish the correctness of the GTP Algorithm, we need to verify that the resulting 
path is indeed proper, i.e. that (P2) holds. 

Lemma 3.4. At each stage of the GTP Algorithm, the associated path space satisfies 
property (P2). 
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Proof. The cone path is trivially proper, so now assume by induction that (P2) holds 
after stage £ — 1 of the algorithm. Let (A ,B ) be the support for com- 



prised of the £ support pairs (A\ , B\ 



(t-i) n (/-i)> 



(A 



(£-1) B (t-l) 



Since the A 



are dropped and the s are added in the order given by their index, we can rep- 

resent this with visually with a left-to-right ordering of the A^ _1 ^'s and Bjf^'s, as 
in Figure [5j Furthermore, since each support pair must be entirely con- 

tained in some support pair (Aj, B r -) at every stage r < £ of the algorithm, the added 
support pairs maintain the existing left-to-right order. Thus we can depict several 
partitions with different degrees of refinement in the same diagram for instructional 
purposes (see Figure [5]). Since (PI) holds at each stage, then there can be no edges 
of the incompatibility graph between an element of A\ and an element of any group 
B r - to the "left" of A-. 

J L 



Ui 



X 



lh 



A 



i-1 



Ai 

C\ I C 2 



1 L' 



V 



Vi 



Bi-! 



W 



D 1 I D 2 

Bi 

V 2 



Figure 5: The refinements of £ and £' referred to in the proof of Lemma [3.41 

Now suppose that at stage £ some support pair (Af~ l \ Bf~ 1 ^) returns a nontrivial 
solution to the Extension Problem, comprised of partitions (Ci,^) of Af ^ and 
(Di, D 2 ) of Bf~ x \ with C 2 compatible with D x and g < Dropping the 

superscript {£ — 1), we first show that if % > 1 then < I^tt- 

Let r < £ be the stage at which sets A^x and Ai (and hence also sets and Bj) 
are separated in the partition. That is, in stage r — 1 sets and Ai are in the same 
partition X = A^~ X \ and sets -Bj-i and Bi are in the same partition W = Bj . 
Then in stage r the minimum weight vertex cover JJ\ U V 2 is found associated with 
the Extension Problem on (X, W), creating partitions (U\,U 2 ) of X and (Vi, V 2 ) of 
W, with Ai-i eU^AiEUi, G Vi, and B t e V 2 . 

Consider the vertical lines L and V in Figure From the discussion above, there 
can be no incompatibility-graph edges from the right of L in T to the left of L in T", 
and hence (U\\Ai-\) U (V 2 U -Bj_i) is a vertex cover for G(X, W). Likewise there can 
be no incompatibility-graph edges from the right of V in T to the left of V in T", and 
hence {U\ U Ci) U (V2\-Di) is also a vertex cover for G(X, W). But because f/i U V 2 is 
a minimum weight vertex cover, then it must have weight no greater than either of 
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these covers. Hence 

ll^ill 2 , INI 2 < WAA^W 2 ||^ 2 u^-i|| 2 

ll^ill 2 + IW + INI 2 + INI 2 " llf/ill 2 + HW INI 2 + INI 2 

ll^ill 2 ll^ill 2 | INI 2 ll^-ill 2 

' II W + II W ll^ill 2 + II W INI 2 + INI 2 + INI 2 + INI 2 



and 



if ini 2 ii^uCiii 2 , \\v 2 \d, 

I - II t r MO IIt r 111 



i|| 2 + l« INI 2 + INI 2 " W + W INI 2 + 1« 



ll^lll 2 ll^lll 2 IN| 2 Pi" 2 



IW + IW IW + IW IN| 2 + |N| 2 INI 2 + INI 2 " 



By cancelling terms and cross-multiplying we get 

ll^-ill 2 < Will 2 + WU2W 2 < UdlF 



ll^-ill 2 " INI 2 + INI 2 " Pil 



2 ■ 



and the inequality follows. 

The argument that if i < £ then I^M < II p' +1 || is symmetric. As the other ratios 

H-D2II — H-Bi+iH J 

remained unchanged, we have (P2) satisfied after stage I as well, and the lemma 
follows. □ 

Example 3: Lemma 13.41 does not necessarily hold outside the context of the 
GTP algorithm. In particular, the algorithm may not work correctly if an arbi- 
trary proper path is chosen as the starting path. Consider tree T in Figure Q] and 
the tree V given by the three splits f x : {1, 3}|{0, 2, 4, 5}, f 2 : {1, 3, 4}|{0, 2, 5}, 
and fs : {1, 3, 4, 5}|{0, 2} which have lengths 4, 10, and 2, respectively. Then 
A = ({ei, e 2 }, {^3}) and B = ({/1, f 2 }, {/a}) is the support of a proper path be- 
tween T and T", since 

Pill ||(10,4)|| 3 _ p a || 



IISill 11(4,10)11 2 ||5 2 || 

Now (P3) fails for support pair ({ei, e 2 }, {/1, /2D, since / 2 is compatible with ei 
and = ^ < f = jfei)). The refinement .A' = ({e 2 }, {ej, {e 3 }) and B' = 

({/2}, {/1}; {/s}) indicated by the GTP Algorithm, however, is not the support of 
a proper path, because jjj^jj = ^ > § = NIMH " ms ^ eac i, the support of our new, 
shorter proper path is A" = ({e 2 }, {ei, e^}) and B" = ({f 2 }, {fx, fa}), which is not 
even a refinement of ({ei,e 2 }, {e^}), ({fi, f 2 }, {f^}). Note that when we start with 
the cone path for T , however, we obtain the optimal path after a single iteration of 
the algorithm. 

Theorem 3.5. The GTP Algorithm correctly solves GTP in 0(n A ) time. 
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Proof. Lemma 13.41 implies that each successful solution to the Extension Problem 
results in a proper path whose support has one more support pair, so that after at most 
n — 3 iterations the algorithm will be unable to find any further nontrivial solutions 
to the Extension Problem. It follows that (P3) is satisfied, and so by Theorem [23] the 
resulting path is the geodesic. Further, we need only solve the Extension Problem on 
newly created support pairs, since an extension for one support pair will not change 
the status of any other support pairs. Thus at most n — 3 vertex cover problems 
are be solved throughout the entire algorithm. The complexity of the algorithm then 
follows from Lemma [3.31 □ 

Note that in each iteration the new path satisfies L(T £+l ) < L(T e ). This is 
straightforward to show, although it does not have a direct bearing on the correctness 
of the algorithm, since the termination of the algorithm is determined only by T e 
satisfying property (P3). It does show, however, that the algorithm is a bona fide 
iterative improvement algorithm. 

4 The GTP algorithm with common edges and leaf 
edge-lengths present 

We finish the analysis by showing how to handle common edges, including the leaf 
edges, between the terminal trees. This expanded algorithm allows us to include 
lengths on leaf edges. It also allows leaves — including the root — to have degree 
greater than one, since setting the length of the associated leaf edge to contracts 
the leaf into a vertex with higher degree. 

To handle common edges, we first note from [151 Theorem 2.1] that if T and T' 
share common edges, then these common edges will be present in every tree on the 
geodesic, with their lengths changing uniformly between those of their starting and 
ending trees. This suggest the following procedure for dealing with common edges: 

1. Identify the set C of all common nonleaf edges in both trees. Also let L be the 
set of leaf edges. 

2. Bisect each edge e G C by adding midpoint v e . 

3. Separate T and T' at each of the vertices v e for e G C. By definition this will 
leave a collection of pairs of disjoint subtrees (T(£),T'(£)) of T and T', indexed 
by £ = 1, . . . , r and with each pair having identical sets of leaves. 

4. For each pair of trees in this collection, apply the GTP Algorithm. Let 
(Ai(£), . . .,A ki (£)) and (B^i), . . .,B kl (£)), I = 1, . . . , r, be the support for the 
associated paths. 

5. The composite path can be described as in Theorem I2.4[ with the following 
modifications: 

(a) For each A, the lengths of the edges in the tree Ti(£) associated with the 
pair (T(£),T'(£)) will be as given in Theorem 12.41 The A is common across 
all pairs. 
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(b) Each edge e G C reconnects the T(£)'s by reattaching them at v e , with the 
splits defined accordingly. 

(c) The length of each common edge e G C U L on the path is 



(1 - A)|e| r + A|e 



T'- 



(d) The length of T is 



L(T) 



:iiA(i)ii + ii5i(i; 

|K(r)|| + ||i3i(r) 
|ec| T - |ecL) 



11^(1; 



where \&c\ T an d |ec| T , are the vectors of the lengths of the common edges 
in the appropriate tree. 

Since the partitioning of the tree can be done in linear time, this will not increase 
the complexity of the algorithm. An implementation of this algorithm is available at 
|http : / / www, st at-or.unc.edu/ webspace / miscellaneous / pr ovan / treespa"ce| 



5 Conclusion 



This paper presents the first polynomial time algorithm for finding geodesies be- 
tween phylogenetic trees in tree space, as well as further characterizing properties 
of geodesies. This significantly increases the usefulness of the geodesic distance as 
a modeling tool, since the previous exponential algorithms essentially restricted the 
geodesic distance measure to trees with fewer than 50 leaves. 

We first note that the technique presented here also solves GTP in the case where 
there is a specific right-left ordering on the non-root leaves of the tree, or equivalently, 
where the tree must be planar with respect to a given clockwise ordering of the leaves. 
An example of such trees are binary search trees. This condition simply adds to the 
definition of a tree that the splits of the tree must be noncrossing, that is, for any 
split X e \X e there are no pairs V\, V2 G X e and ^3,^4 G X e which appear in clockwise 
order v\, V3, V2, V4. Since if T and T' both satisfy the noncrossing property property, 
and all of the splits in the intermediate trees on the geodesic between T and T' are 
made up of the splits of T and T", then these trees must also satisfy the noncrossing 
property, and so the geodesic for this case is the same as that for the unrestricted 
case. 

The properties and techniques given here potentially apply to the much wider 
range of problems and measures on trees that make use of the intrinsic Euclidean 
nature of tree space. For example, Nye [H] compares and groups trees through the 
idea of "medial trees" which serve as representatives for topological types within 
data sets of trees. Hillis et al. [8] also investigate tree sets using distance as the 
distinguishing feature to find statistical groupings and common features. Using a 
more Euclidean-related measure for dissimilarity could allow more powerful statistical 
techniques to be employed in these situations. 
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Billera et al. [I] look at the concept of medial trees in their paper by defining 
the centroid of a set of points in tree space. Their definition involves an iterative 
process that is based on finding a converging sequence of midpoints of geodesies 
between trees. The implementation of this would require a fast method of computing 
geodesies. Another way of thinking about a centroid in standard Euclidean space, 
though, is as the point of minimum sum squared distance to the trees. The framework 
for finding geodesies here naturally lends itself to finding centroids in this alternate 
sense as well, and could yield a more direct and efficient way of computing centroids. 

A further extension of the idea of centroid comes up in the development of object 
oriented data analysis (OODA) as it has been applied to trees [22]. This involves 
fitting a "line" to a set of trees in such a way as to minimize least-squares distances. 
The set of nearest points on this line can then be analyzed to yield statistical dis- 
criminators that can in turn isolate significant properties of the underlying objects. 
This can be done a second time, as a result gaining "second-order" information about 
the set of trees, and so on. Two problems with OODA have been (a) the difficulty 
in determining the right concept of "line" and "least-square distances" when the ob- 
jects are not Euclidean in nature, and (b) the computational challenge in actually 
finding these "least-fit" objects. The path space concept presents a compelling model 
for facilitating both these kinds of analyses, with the CAT(O) property providing the 
framework for efficient iterative improvement methods to extract useful statistical 
information in this context. 
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